QTM 150 Group 2F Project
The Impact Times Square has on New York City (NYC) AirBnB Prices
Part A
0. AIRBnB NYC Dataset
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble 2.1.3 ✓ purrr 0.3.3
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(leaflet)
library(stringr)
library(sf)
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
library(mapview)
library(htmlwidgets)
library(webshot)
library(tigris)
## To enable
## caching of data, set `options(tigris_use_cache = TRUE)` in your R script or .Rprofile.
##
## Attaching package: 'tigris'
## The following object is masked from 'package:graphics':
##
## plot
library(sp)
library(maptools)
## Checking rgeos availability: TRUE
library(broom)
library(httr)
library(rgdal)
## rgdal: version: 1.4-8, (SVN revision 845)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 2.4.2, released 2019/06/28
## Path to GDAL shared files: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/rgdal/gdal
## GDAL binary built with GEOS: FALSE
## Loaded PROJ.4 runtime: Rel. 5.2.0, September 15th, 2018, [PJ_VERSION: 520]
## Path to PROJ.4 shared files: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/rgdal/proj
## Linking to sp version: 1.3-2
AIRBnB <- read.csv("AIRBnB_NYC.csv")
1. Describing Our Project Investigation
After a long investigation of the dataset, we stepped away with one major takeaway: proximity and accesibility to NYC attractions seems to affect listings’ prices a lot. Thus, we wanted to explore this a bit more in depth with our project. We came into this project wondering why the price of each listing can change so dramatically from neighbourhood to neighbourhood, so we wanted our project to be location focused. Our goal with this project is to show that attractions can have a big impact on the price of AirBnB listings. We chose to focus in on NYC’s arguably most notable and visited attraction: Times Square. Within our project we want to investigate the impact Times Square has on listings’ prices. Since the price will also be different according to room type (as found through our investigating), we chose to focus on one room type for consistency: Entire home/apartment.
2. Numerical Response Variable
For our Numerical Response Variable, we chose to use the listings’ price. We want to use price as this variable because we want to investigate the price differences according to location and proximity to Times Square. By using this as the numerical response variable, we can determine what these price differences are. We only included data from Manhattan and Brooklyn because that is where our study is focused.
summary(AIRBnB$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 69.0 106.0 152.7 175.0 10000.0
sd(AIRBnB$price, na.rm = T)
## [1] 240.1542
ggplot(AIRBnB, aes(price, na.rm = T)) +
geom_histogram(aes(y = ..density..),fill = "green", col= "red", alpha = 0.2) +
labs(title="Price Distribution for AirBnB Listings")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

3. Outliers of Numerical Response Variable
As can be seen in the histogram above, the distribution is extremely skewed to right, which indicates that there are price outliers that need to be isolated. This means that there are listings with unusually high prices that we may want to take out of our analysis because we want to look at how Times Square affects the majority of listings that are the most price-accessible to the average NYC tourist.
From our initial investigation, we found that room type plays a key role in price (Entire home/apt are priced higher than Private room), so we decided to focus on one room type for our analysis: Entire home/apt. We chose this room type because the majority of listings are Entire home/apt in Manhattan and north Brooklyn (which as we will see in 5. as our main focus).
During our initial investigation, we found that the majority of all the listings were price ranged between 0-350 dollars, making this the ideal range to observe the majority of listings.We decided that we wanted to keep these prices because we want to know price listings look like for the majority of NYC tourists on AirBnB. The result shows a more even distribution of listings.
For the listings that we excluded, we did investigate why they may have been more expensive. We split them into four luxury groups, each with a different price range. Prices in all groups tend to gravitate towards numbers divisible by 50 and 100, which makes sense since AirBnB has a price slider function. We concluded that that these luxury apartments were located in more affluent communities, only available for a small portion of the year (possibly holidays or big events), or offered a higher than usual quality listing (like dream destinations). These distributions are in our appendix at the end of the document.
AIRBnB <- AIRBnB %>% filter(room_type=="Entire home/apt")
AIRBnB_pricekeep <- AIRBnB %>% filter(price<=350)
ggplot(AIRBnB_pricekeep, aes(price, na.rm = T)) +
geom_histogram(aes(y = ..density..),fill = "green", col= "red", alpha = 0.2) +
labs(title="Price Distribution for Entire Home/Apartment Listings from $0-$350")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

AIRBnB <- AIRBnB %>% filter(price<=350)
4. Main Explanatory Variable
We made our main explanatory variable our geographical data, longitude and latitude. This is technically two variables, but we need to introduce them both because the information would be unuseful if we were to only use exclusively longitude or latitude. Since location plays such a big role in our analysis, it was important to us to see if we could detect where higher priced AirBnBs tend to be in NYC.
summary(AIRBnB$longitude)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -74.24 -73.99 -73.96 -73.96 -73.94 -73.72
summary(AIRBnB$latitude)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 40.51 40.69 40.73 40.73 40.76 40.91
ggplot(AIRBnB, aes(x=longitude,y=latitude)) +
geom_point() + labs(x="Latitude", y="Longitude", title="Map of AirBnB Listings in NYC")

5. Outliers of Main Explanatory Response Variable
The biggest problem that we saw working with our geographical data was that not all areas within NYC were equally represented on our AirBnB list. We wanted to see a full map of New York City, but many areas are patchy and not as full as other areas.
After further investigation of our data, we found that the majority of our listings were located in Manhattan and Brooklyn. The other three neighbourhood groups didn’t have nearly as many listings, thus we concluded that we needed to exclude these areas and focus on Manhattan and Brooklyn. The other three neighbourhood groups simply did not have big enough samples to accurately compare them with Manhattan and Brooklyn.
After this, we found that South Brooklyn still looked very patchy, so we decided to only focus on North Brooklyn and all of Manhattan because we wanted each area to be equally represented. This also works out with our analysis because we want to focus on the affect Times Square has on listings, and our chosen areas all surround Times Square.
ggplot(AIRBnB, aes(neighbourhood_group)) +
geom_bar() +
labs(x="NYC Neighbourhood Groups", y="Number of Listings", title="Density of Listings for Each Area of NYC")

AIRBnB <- AIRBnB %>% filter(neighbourhood_group=="Manhattan"|neighbourhood_group=="Brooklyn")
ggplot(AIRBnB, aes(x=longitude,y=latitude)) +
geom_point() + labs(x="Latitude", y="Longitude", title="Map of AirBnB Listings in Manhattan and Brooklyn")

AIRBnB_n <- AIRBnB %>% group_by(neighbourhood) %>%
summarize(n=n())
AIRBnB <- merge(AIRBnB,AIRBnB_n) %>% filter(neighbourhood_group=="Manhattan"|n>=250)
ggplot(AIRBnB, aes(x=longitude,y=latitude)) +
geom_point() + labs(x="Latitude", y="Longitude", title="Map of AirBnB Listings in Manhattan and Brooklyn")

6. Relationship Between Variables
The relationship between our variables would be as follows: By mapping out the listings’ prices out with longitude and latitude, we should be able to identify areas where prices tend to be higher. We want to find the location of these higher priced listings. By looking at the pattern of the price levels, we can determine if prices do tend to be higher near Times Square (longitude=40.7580, latitude=-73.9855) or other areas. The higher the density of price is in a certain area, the higher AirBnBs tend to be in that given area, and vice versa. Areas of concentrated high prices indicate the trends we are looking for here.
7. Plot
The following plot shows a temperature map of higher prices across Manhattan and North Brooklyn. The red triangle represents Times Square, so that the observer can see how high prices are influenced by Times Square.
ggplot(AIRBnB, aes(x=longitude,y=latitude)) +
geom_point(aes(color=price),position='jitter') +
stat_density_2d(aes(fill= ..level..),geom="polygon") +
geom_point(aes(x=-73.9855,y=40.7580),pch=24,color="red") +
labs(x="Latitude", y="Longitude", title="Price Level Map of AirBnBs in Manhattan and North Brooklyn",caption="*Times Square Represented by Red Triangle on Map")

8. Pattern
As can be clearly seen in the plot above, there does appear to a few areas where the prices of AirBnBs tend to be higher. It does appear like most of the areas with higher prices are closer to Times Square and probably other NYC attractions. It’s interesting because even though Times Square is located in Manhattan, the areas of higher prices in Brooklyn seem to be closer to Times Square as well.
Interestingly (and what we expected), the higher priced areas tend to surround the outside of Time Square instead of being entirely on Times Square. Our team suspected this may be the case because in our earlier investigation of the dataset, there were less listings in the Times Square area (like the neighbourhood Midtown, where it is located). From our observations, we think this may be because it is a very business heavy neighbourhood (ie. stores, attractions, and hotels take up most of the buildings) and city limitations on AirBnBs are greater in this area (ie. the city wants to prevent all the residential buildings being taken over by tourist housing). Thus, the majority of higher priced AirBnBs are just outside of Times Square. There’s also a patch of high prices in Williamsburg, Brooklyn, which is probably due to the proximity to the Williamsburg Bridge that leads into the Lower East Side, Manhattan.
Part B
1. Third Variable
We chose neighbourhood to be our secondary explanatory variable because it is a more specific version of our first explanatory variable, neighbourhood_group and will allow us to explore the specific neighbourhood price differences. Exploring the average prices of each neighbourhood will allow us to see how location affects price in a broader context than latitude and longitude.
2. Summarizing Third Variable
summary(AIRBnB$neighbourhood)
## Williamsburg Bedford-Stuyvesant
## 1760 1541
## Upper East Side Upper West Side
## 1212 1112
## East Village Hell's Kitchen
## 1064 1015
## Harlem Midtown
## 986 861
## Crown Heights Chelsea
## 727 692
## Bushwick Greenpoint
## 661 595
## West Village Financial District
## 563 493
## Lower East Side East Harlem
## 469 468
## Murray Hill Park Slope
## 358 340
## Clinton Hill Kips Bay
## 317 301
## Washington Heights Fort Greene
## 290 286
## Prospect-Lefferts Gardens Flatbush
## 271 254
## Greenwich Village Gramercy
## 252 216
## Chinatown SoHo
## 192 162
## Morningside Heights Nolita
## 146 145
## Theater District Inwood
## 129 102
## Tribeca Little Italy
## 84 67
## NoHo Flatiron District
## 52 45
## Battery Park City Civic Center
## 38 32
## Two Bridges Roosevelt Island
## 26 18
## Stuyvesant Town Marble Hill
## 11 5
## Allerton Arden Heights
## 0 0
## Arrochar Arverne
## 0 0
## Astoria Bath Beach
## 0 0
## Bay Ridge Bay Terrace
## 0 0
## Bay Terrace, Staten Island Baychester
## 0 0
## Bayside Bayswater
## 0 0
## Belle Harbor Bellerose
## 0 0
## Belmont Bensonhurst
## 0 0
## Bergen Beach Boerum Hill
## 0 0
## Borough Park Breezy Point
## 0 0
## Briarwood Brighton Beach
## 0 0
## Bronxdale Brooklyn Heights
## 0 0
## Brownsville Bull's Head
## 0 0
## Cambria Heights Canarsie
## 0 0
## Carroll Gardens Castle Hill
## 0 0
## Castleton Corners City Island
## 0 0
## Claremont Village Clason Point
## 0 0
## Clifton Co-op City
## 0 0
## Cobble Hill College Point
## 0 0
## Columbia St Concord
## 0 0
## Concourse Concourse Village
## 0 0
## Coney Island Corona
## 0 0
## Cypress Hills Ditmars Steinway
## 0 0
## Dongan Hills Douglaston
## 0 0
## Downtown Brooklyn DUMBO
## 0 0
## Dyker Heights East Elmhurst
## 0 0
## East Flatbush East Morrisania
## 0 0
## East New York Eastchester
## 0 0
## Edenwald (Other)
## 0 0
3. Plot
neighbourhoodaverages <- AIRBnB %>% group_by(neighbourhood) %>%
summarize(price=mean(price),latitude=mean(latitude),longitude=mean(longitude))
ggplot(neighbourhoodaverages, aes(x=longitude, y=latitude, color=price)) +
geom_point() +
labs(x="Latitude", y="Longitude", title="Average Price of AIRBnB Listings by Neighbourhood", caption="*Times Square Represented by Red Triangle on Map") +
geom_point(aes(x=-73.9855,y=40.7580),pch=24,color="red") +
geom_text(aes(label=neighbourhood), check_overlap = T)

ggplot(neighbourhoodaverages, aes(x=longitude, y=latitude, color=price)) +
geom_point() +
labs(x="Latitude", y="Longitude", title="Average Price of AIRBnB Listings by Neighbourhood", caption="*Times Square Represented by Red Triangle on Map") +
geom_point(aes(x=-73.9855,y=40.7580),pch=24,color="red")

4. Pattern
This plot has much fewer observations than the one produced in part A because all of the observations in a neighbourhood were condensed to one point that represented the entire neighbourhood. The plot matched our expectations because the general trend was that neighbourhoods closer to Times Square were more expensive than neighbourhoods farther away from Times Square.
5. Improvements Plot
Since our map looks sparse now that we’re only looking at neighbourhood averages, it is admittedly hard to fully visualize. Even with the marker that represents Times Square, there is no way of seeing the geography like roads and bridges that may better inform about the distance between each neighbourhood and Times Square. It’s also very hard to display each of the neighbourhood names without the plot being difficult to read.
Our first solution is to put the point on a basic map of NYC so that the user can see what the dots on the graph mean. This way it is better to visualize and the user can easily see the proximity each dot has to Times Square.
Our other solution for this is to instead put our points on an interactive map, and we used leaflet to do it! With our improved plot, the user can see the geography of Manhattan and North Brooklyn, where Times Square is located, and can click on each individual circle to see what the neighbourhood name and average price is. If the user wished to zoom in/zoom out on the map to see more detail, they can do so as they please. We also made the average price the size of each circle instead of color so that they can be easily seen on the map.
We think that this improvement would be ideal for users that want to choose AirBnBs based on attributes found in specific neigbourhoods. NYC has a ton of AirBnB listings, so it may be hard for tourists to choose the best neighbourhood for them. This map allows users to pick neighbourhoods based on average price and location.
6. Plot
# the following two plots were made with the help of the tutorial on NYC mapping at https://rpubs.com/jhofman/nycmaps.
lookup_code("New York", "New York")
nyc_tracts <- tracts(state = '36', county = c('061','047','081','005','085'))
r <- GET('http://data.beta.nyc//dataset/0ff93d2d-90ba-457c-9f7e-39e47bf2ac5f/resource/35dd04fb-81b3-479b-a074-a27a37888ce7/download/d085e2f8d0b54d4590b1e7d1f35594c1pediacitiesnycneighborhoods.geojson')
nyc_neighborhoods <- readOGR(content(r,'text'), 'OGRGeoJSON', verbose = F)
nyc_neighborhoods_df <- tidy(nyc_neighborhoods)
ggplot() +
geom_polygon(data=nyc_neighborhoods_df, aes(x=long, y=lat, group=group)) +
geom_point(data=neighbourhoodaverages, aes(x=longitude, y=latitude, color=price)) +
scale_color_gradient(low="blue", high="red") +
labs(x="Latitude", y="Longitude", title="Average Price of AIRBnB Listings by Neighbourhood", caption="*Times Square Represented by White Triangle on Map") +
geom_point(aes(x=-73.9855,y=40.7580),pch=24,color="white")

options(digits = 3)
set.seed(1234)
theme_set(theme_minimal())
m <- leaflet(neighbourhoodaverages) %>%
addTiles() %>%
addMarkers(lng = -73.9855, lat = 40.7580,
popup = "Times Square") %>%
addCircles(lng= ~longitude, lat= ~latitude, weight=1, radius=~sqrt(price)*30,
popup = paste("Neighbourhood:", neighbourhoodaverages$neighbourhood, "<br>",
"Average Price:", neighbourhoodaverages$price, "<br>"))
m
mapshot(m,file = paste0(getwd(), "/map.png"))
7. Second Explanatory Variable
Our second explanatory variable we are choosing is distance from Time Square. For this variable, we have used the longitude and latitude variables to calculate the distance in miles of different AirBnB’s to Time Square. We are removing all listings from Brooklyn, because the distance measured doesn’t account for bridges and such. We are choosing to use this variable rather than neighborhood groups, because it will allow us to take an even closer look at how proximity to Time Square affects listing prices. While neighborhood groups does show the average prices of neighborhoods around Time Square, it doesn’t allow us to see the specific distance of those groups to Time Square and how they compare to one another. For our new variable we have broken down the distance into five sections. Less than 0.9 miles is considered “very close,” between 0.9 to 1.5 miles is considered “close,” between 1.5 to 2.2 miles is considered “somewhat close,” between 2.2 to 3 miles is considered “not close or far,” and between 3 to 6 miles is considered “very far.” We chose these boundaries so that we had a similar and large enough amount of data in each bin. We also chose to remove all the data points that had a distance farther than 6 miles away from Time Square. There aren’t enough data points compared to the rest of the data set to get an accurate enough sample, so we only focused on AirBnb’s within 6 miles of Time Square.
p <- 0.017453292519943295;
AIRBnB$dlon <- abs(AIRBnB$longitude+73.9855)*p
AIRBnB$dlat <- abs(AIRBnB$latitude-40.7580)*p
AIRBnB$a <- (sin(AIRBnB$dlat/2) * sin(AIRBnB$dlat/2)) + cos(40.7580*p) * cos(AIRBnB$latitude*p) * (sin(AIRBnB$dlon/2) * sin(AIRBnB$dlon/2))
AIRBnB$c <- 2 * atan2(sqrt(AIRBnB$a),sqrt(1-AIRBnB$a))
AIRBnB$Distance <- 3961 * AIRBnB$c # Distance in miles
AIRBnB <- AIRBnB %>% filter(neighbourhood_group=="Manhattan")
AIRBnB$Distance.f <-factor(AIRBnB$Distance,levels=c("Very Close", "Close", "Somewhat Close", "Not Close or Far", "Far"))
AIRBnB$Distance.f[AIRBnB$Distance<=0.9]<-"Very Close"
AIRBnB$Distance.f[AIRBnB$Distance>0.9 & AIRBnB$Distance<= 1.5]<-"Close"
AIRBnB$Distance.f[AIRBnB$Distance>1.5 & AIRBnB$Distance<= 2.2]<-"Somewhat Close"
AIRBnB$Distance.f[AIRBnB$Distance>2.2 & AIRBnB$Distance<= 3]<-"Not Close or Far"
AIRBnB$Distance.f[AIRBnB$Distance>3 & AIRBnB$Distance<= 6]<-"Far"
table(AIRBnB$Distance.f)
##
## Very Close Close Somewhat Close Not Close or Far
## 2028 1953 2541 2274
## Far
## 2495
8. Summary and Plot
As you can see in the summary table and plot for the most part, as distance increases from Time Square, price will decrease. It is interesting that “close” AirBnB’s had a slightly higher price than “very close” AirBnB’s and “not close or far” AirBnB’s had a slighlty higher price than “somewhat close” AirBnB’s. Besides these small discrepancies, there is a clear correlation between distance to Time Square and price. This plot is different than the other plots that we have produced, because it shows the actual correlation between distance and price. In previous plots, we have mapped out the distance and price, but this plot shows the actual relationship with specific distances and price.
averageprice <- AIRBnB %>%
group_by(Distance.f) %>%
summarise(avg=mean(price, na.rm =T)) %>%
filter(Distance.f=="Very Close"|Distance.f=="Close"|Distance.f=="Somewhat Close"|Distance.f=="Not Close or Far"|Distance.f=="Far")
## Warning: Factor `Distance.f` contains implicit NA, consider using
## `forcats::fct_explicit_na`
averageprice
## # A tibble: 5 x 2
## Distance.f avg
## <fct> <dbl>
## 1 Very Close 203.
## 2 Close 199.
## 3 Somewhat Close 189.
## 4 Not Close or Far 185.
## 5 Far 168.
ggplot(averageprice, aes(x= Distance.f, y=avg)) +
geom_bar(color = "green", stat ='identity') +
labs(x="Distance to Time Square", y= "Average Price", title="Average Price Compared to Proximity to Time Square", subtitle="Average Price of Entire Home/Apartment AirBnB in Manhattan")

9. Conclusion
Looking over our data, we found that overall, AirBnB’s closer to Time Square tend to have a higher price. Our first explanatory variable, longitude and latitude, mapped a heat map and showed us that the area around Time square had higher prices. Our next explanatory variable neighborhood showed us the actual prices of neighborhoods around Time Square and we were able to see that prices rose closer to Time Square. Our last explanatory variable allowed us to visualize the actual distance to time square and how the prices compared. These three variables all led to the conclusion that as distance decreases to Time Square, prices of AirBnB’s will rise. This makes sense, because the demand to see one of New York’s finest attractions is very high, so the demand for places to stay nearby is also incredibly high. Since demand is so high and there is only a limited supply of AirBnB’s, the price rises. Tourists can use this information if they want to visit New York and see Time Square. If they’re looking for a more affordable trip, they should look for AirBnB’s farther away from the beautiful attraction. On the other hand, if their priority is to have easy access to Time Square, they can book a more expensive AirBnB nearby. For further analysis in the future, it may be interesting to look at the effect other big NYC attractions have as well the subway system.
Appendix.
Initial Investigation of Dataset
The following plots are from our initial investigation of the dataset. We left it in an appendix incase we refer to it throughout our analysis.
AIRBnB <- read.csv("AIRBnB_NYC.csv")
# number of listings by area of NYC
# the most listings come from Manhattan and Brooklyn (and room types)
ggplot(AIRBnB, aes(neighbourhood_group)) +
geom_bar(aes(fill=room_type)) +
labs(x="NYC Neighbourhood Groups", y="Number of Listings", title="Density of Listings for Each Area of NYC")

# Investigating the median price in Manhattan and Brooklyn
AIRBnB_MB <- AIRBnB %>%
filter(neighbourhood_group=="Manhattan"|neighbourhood_group=="Brooklyn",price<500)
ggplot(AIRBnB_MB, aes(x=neighbourhood_group , y=price, fill=factor(room_type))) +
geom_boxplot() +
labs(x="Top Two Neighbourhood Groups", y="Price ($)", title="Median Price of Manhattan and Brooklyn")

# Investigating Manhattan's top 6 neighbourhoods (and room types)
AIRBnB_Manhattan <- AIRBnB %>%
filter(neighbourhood_group=="Manhattan",neighbourhood=="East Village"|
neighbourhood=="Harlem"|neighbourhood=="Hell's Kitchen"|
neighbourhood=="Midtown"|neighbourhood=="Upper East Side"|
neighbourhood=="Upper West Side")
ggplot(AIRBnB_Manhattan, aes(neighbourhood)) +
geom_bar(aes(fill=room_type)) + coord_flip() +
labs(x="Number of Listings", y="Top Neighbourhoods in Manhattan", title="Six Most Listing-Dense Neighbourhoods in Manhattan")

AIRBnB_Manhattan <- AIRBnB_Manhattan %>% filter(price<500)
ggplot(AIRBnB_Manhattan, aes(x=neighbourhood,y=price,fill=factor(room_type))) +
geom_boxplot() +
labs(x="Neighbourhood", y="Price ($)", title="Median Price of Top Manhattan Neighbourhoods")

# Investigating Brooklyn's top 4 neighbourhoods (and room types)
AIRBnB_Brooklyn <- AIRBnB %>%
filter(neighbourhood_group=="Brooklyn",neighbourhood=="Williamsburg"|
neighbourhood=="Crown Heights"|neighbourhood=="Bushwick"|
neighbourhood=="Bedford-Stuyvesant")
ggplot(AIRBnB_Brooklyn, aes(neighbourhood)) +
geom_bar(aes(fill=room_type)) + coord_flip()+
labs(x="Number of Listings", y="Top Neighbourhoods in Brooklyn", title="Four Most Listing-Dense Neighbourhoods in Brooklyn")

AIRBnB_Brooklyn <- AIRBnB_Brooklyn %>% filter(price<500)
ggplot(AIRBnB_Brooklyn, aes(x=neighbourhood,y=price,fill=factor(room_type))) +
geom_boxplot() +
labs(x="Neighbourhood", y="Price ($)", title="Median Price of Top Brooklyn Neighbourhoods")

AIRBnB2 <- read.csv("AIRBnB_NYC.csv")
singleMB <- AIRBnB2 %>%
filter(room_type=="Entire home/apt", neighbourhood_group=="Manhattan"|neighbourhood_group=="Brooklyn")
AIRBnB_priceluxury1 <- AIRBnB2 %>% filter(price>350&price<=500)
ggplot(AIRBnB_priceluxury1, aes(price, na.rm = T)) +
geom_histogram(aes(y = ..density..),fill = "green", col= "red", alpha = 0.2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

AIRBnB_priceluxury2 <- AIRBnB2 %>% filter(price>500&price<=1000)
ggplot(AIRBnB_priceluxury2, aes(price, na.rm = T)) +
geom_histogram(aes(y = ..density..),fill = "green", col= "red", alpha = 0.2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

AIRBnB_priceluxury3 <- AIRBnB2 %>% filter(price>1000&price<=2000)
ggplot(AIRBnB_priceluxury3, aes(price, na.rm = T)) +
geom_histogram(aes(y = ..density..),fill = "green", col= "red", alpha = 0.2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

AIRBnB_priceluxury4 <- AIRBnB2 %>% filter(price>2000)
ggplot(AIRBnB_priceluxury4, aes(price, na.rm = T)) +
geom_histogram(aes(y = ..density..),fill = "green", col= "red", alpha = 0.2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Airb <-singleMB %>% filter(price<350)
Outliers <-singleMB %>% filter(price>350)
ggplot(Airb, aes(x=neighbourhood_group, y=price)) + geom_boxplot()

#Trend
#Availability-Manhattan outliers have higher number of days when listing is available for booking, Brookyln is the same
ggplot(Outliers, aes(x=neighbourhood_group, y=availability_365)) +geom_boxplot()

ggplot(Airb, aes(neighbourhood_group, y = availability_365)) + geom_boxplot()

#Number of reviews- Outliers have a lot less reviews
ggplot(Outliers, aes(x=neighbourhood_group, y=number_of_reviews)) +geom_boxplot()

ggplot(Airb, aes(neighbourhood_group, y = number_of_reviews)) + geom_boxplot()

#minimum nights- No trend
ggplot(Outliers, aes(x=neighbourhood_group, y=minimum_nights)) +geom_boxplot()

ggplot(Airb, aes(neighbourhood_group, y = minimum_nights)) + geom_boxplot()

#reviews per month- outliers have less reviews per month
ggplot(Outliers, aes(x=neighbourhood_group, y=reviews_per_month)) +geom_boxplot()
## Warning: Removed 680 rows containing non-finite values (stat_boxplot).

ggplot(Airb, aes(neighbourhood_group, y = reviews_per_month)) + geom_boxplot()
## Warning: Removed 3835 rows containing non-finite values (stat_boxplot).

#calculated list hostings- outliers hosts have less listings per host
ggplot(Outliers, aes(x=neighbourhood_group, y=calculated_host_listings_count)) +geom_boxplot()

ggplot(Airb, aes(neighbourhood_group, y = calculated_host_listings_count)) + geom_boxplot()
